Table of Contents ¶

  • Introduction
  • Data Inspection
    • Data Loading
    • Check for Missing Values
    • rename columns
    • View Data Information
    • Show Information of All Categorical Features
    • Categorical Digitization
  • Data Analysis
    • Descriptive Analysis
    • Employee Turnover Ratio
    • Relationship Between Employee Satisfaction and Years Spent at Company
    • Comparison of Employee Turnover by Department
    • Employee Salary Levels by Department
    • Impact of Employee Salary on Turnover Rate
    • Comparison of Employee Promotions in Last 5 Years and Turnover
    • Comparison of Employee Performance Evaluation and Turnover
  • Modeling
    • Correlation Analysis
    • Prepared training and testing sets
    • RandomForestClassifier
    • XGBoost
    • LogisticRegression
    • Summary

Introduction¶

The Human Resources Analytics project uses a simulated dataset to understand why the best and most experienced employees are leaving the company.

By analyzing factors like salary, department, and employee satisfaction, this project aims to identify patterns contributing to employee attrition. The insights gained will help HR and management develop effective strategies to improve employee retention and foster a more productive work environment.

Data Source: https://www.kaggle.com/datasets/colara/human-resource/data

Purpose

Data Analysis:

Descriptive Analysis: To provide an overview of employee turnover metrics such as turnover ratio, departmental comparisons, and trends.

Relationship Exploration: To investigate correlations between employee satisfaction, tenure, and turnover, identifying key factors influencing attrition.

Salary and Promotion Impact: To assess the impact of salary levels and promotion opportunities on employee turnover.

Modeling:

Correlation Analysis: To identify significant predictors of employee turnover using statistical methods, guiding feature selection for predictive models.

Predictive Modeling: To develop RandomForestClassifier, XGBoost, and LogisticRegression models to predict employee turnover. Evaluate models based on accuracy and AUC to inform decision-making and intervention strategies.

Data Inspection¶

In [ ]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool,FactorRange
from bokeh.io import output_notebook
import seaborn as sns
from bokeh.transform import dodge
from bokeh.palettes import Category20

from sklearn.model_selection import train_test_split,cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc,roc_auc_score
from sklearn.preprocessing import LabelEncoder

from bokeh.resources  import settings

settings.resources = 'inline'

Data Loading¶

In [ ]:
# data loading

data = pd.read_csv("HR_comma_sep.csv")
In [ ]:
data.head()
Out[ ]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

Check for Missing Values¶

In [ ]:
data.isnull().sum()
Out[ ]:
satisfaction_level       0
last_evaluation          0
number_project           0
average_montly_hours     0
time_spend_company       0
Work_accident            0
left                     0
promotion_last_5years    0
sales                    0
salary                   0
dtype: int64

rename columns¶

In [ ]:
# Renaming the 'sales' column to 'department'
data = data.rename(columns={"sales": "department"})
data.columns
Out[ ]:
Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'department', 'salary'],
      dtype='object')

View Data Information¶

In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   department             14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

Show Information of All Categorical Features¶

In [ ]:
data.describe(include=['O'])
Out[ ]:
department salary
count 14999 14999
unique 10 3
top sales low
freq 4140 7316

Categorical Digitization¶

In [ ]:
# Convert 'salary' and 'department' to categorical data type
data['salary'] = data['salary'].astype('category')
data['department'] = data['department'].astype('category')

salary_dict = dict(enumerate(data['salary'].cat.categories))
department_dict = dict(enumerate(data['department'].cat.categories))

# Create mapping dictionaries
salary_mapping = dict(enumerate(data['salary'].cat.categories))
salary_mapping_reverse = {v: k for k, v in salary_mapping.items()}

department_mapping = dict(enumerate(data['department'].cat.categories))
department_mapping_reverse = {v: k for k, v in department_mapping.items()}

# Create new columns for numerical values
data['salary_num'] = data['salary'].cat.codes
data['department_num'] = data['department'].cat.codes
In [ ]:
data.head()
Out[ ]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years department salary salary_num department_num
0 0.38 0.53 2 157 3 0 1 0 sales low 1 7
1 0.80 0.86 5 262 6 0 1 0 sales medium 2 7
2 0.11 0.88 7 272 4 0 1 0 sales medium 2 7
3 0.72 0.87 5 223 5 0 1 0 sales low 1 7
4 0.37 0.52 2 159 3 0 1 0 sales low 1 7

Data Analysis¶

Descriptive Analysis¶

In [ ]:
# Exclude categorical columns for mean calculations
columns_to_exclude = ['department', 'salary']
numeric_df = data.drop(columns=columns_to_exclude)
In [ ]:
# Group by the 'left' column
left_summary = numeric_df.groupby(by=['left'])
numeric_df.mean()
Out[ ]:
satisfaction_level         0.612834
last_evaluation            0.716102
number_project             3.803054
average_montly_hours     201.050337
time_spend_company         3.498233
Work_accident              0.144610
left                       0.238083
promotion_last_5years      0.021268
salary_num                 1.347290
department_num             5.870525
dtype: float64
In [ ]:
numeric_df.describe()
Out[ ]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years salary_num department_num
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268 1.347290 5.870525
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281 0.625819 2.868786
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000 1.000000 4.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000 1.000000 7.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000 2.000000 8.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000 2.000000 9.000000

Employee Turnover Ratio¶

In [ ]:
# Count the number of employees who stayed and left
left_count = numeric_df['left'].value_counts().reset_index(name="count")

# Plotting a pie chart to show the turnover ratio
labels = ['Stayed', 'Left']
sizes = left_count['count']
colors = ['#1f77b4', '#ff7f0e']  # Blue and orange colors
explode = (0, 0.1)  # Explode the 'Left' slice

plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=labels, colors=colors, explode=explode,
        autopct='%1.1f%%', shadow=True, startangle=140)
plt.title("Employee Turnover Ratio")
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
No description has been provided for this image

Relationship Between Employee Satisfaction and Years Spent at Company¶

In [ ]:
# Calculate average satisfaction level based on years spent at the company
time_mean_satisfaction = numeric_df.groupby('time_spend_company')['satisfaction_level'].mean().reset_index(name="average_satisfaction")

# Plotting a bar chart to show satisfaction vs. years spent at the company
plt.figure(figsize=(10, 6))
plt.bar(time_mean_satisfaction['time_spend_company'], time_mean_satisfaction['average_satisfaction'], color='#1f77b4')
plt.title("Employee Satisfaction vs. Years Spent at Company")
plt.xlabel("Years at Company")
plt.ylabel("Average Satisfaction")
plt.grid(True)
plt.show()
No description has been provided for this image

Comparison of Employee Turnover by Department¶

In [ ]:
# Convert the crosstab to a format suitable for Bokeh
depart_left_table = pd.crosstab(index=numeric_df['department_num'], columns=numeric_df['left'])
depart_left_table.reset_index(inplace=True)

# Map numerical department values to categorical names
departments = [department_dict[dept] for dept in depart_left_table['department_num']]
stayed = depart_left_table[0].tolist()
left = depart_left_table[1].tolist()

data_left = {'departments': departments,
        'Stayed': stayed,
        'Left': left}

source = ColumnDataSource(data=data_left)

# Create the figure
p = figure(y_range=FactorRange(*departments), height=400, width=800,
           title="Employee Turnover by Department", toolbar_location=None, tools="")

# Add bars
p.hbar(y=dodge('departments', -0.15, range=p.y_range), right='Stayed', height=0.3, source=source,
       color=Category20[20][0], legend_label="Stayed")
p.hbar(y=dodge('departments', 0.15, range=p.y_range), right='Left', height=0.3, source=source,
       color=Category20[20][1], legend_label="Left")

# Customize the plot
p.y_range.range_padding = 0.1
p.ygrid.grid_line_color = None
p.xaxis.axis_label = "Count"
p.yaxis.axis_label = "Department"
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

# Output to notebook
output_notebook()
show(p)

Summary: The sales department has the highest number of employees leaving, with 1,014 people. This is followed by the technical department, with 697 people leaving.

Employee Salary Levels by Department¶

In [ ]:
# Crosstab to compare salary distribution by department
depart_salary_table = pd.crosstab(index=data['department'], columns=data['salary'])

# Ensure all salary levels are present for each department
depart_salary_table = depart_salary_table.reindex(columns=salary_dict.values(), fill_value=0)

# Convert the Crosstab data to Bokeh-compatible format
departments = depart_salary_table.index.tolist()
low_salary = depart_salary_table[salary_dict[0]].tolist()
medium_salary = depart_salary_table[salary_dict[1]].tolist()
high_salary = depart_salary_table[salary_dict[2]].tolist()

source = ColumnDataSource(data={
    'departments': departments,
    'Low Salary': low_salary,
    'Medium Salary': medium_salary,
    'High Salary': high_salary
})

# Create the figure
p = figure(x_range=departments, height=400, width=800,
           title="Employee Salary Levels by Department", toolbar_location=None, tools="")

# Add vertical bars with hover tooltips
p.vbar_stack(stackers=['Low Salary', 'Medium Salary', 'High Salary'], x='departments', width=0.9, color=['#1f77b4', '#ff7f0e', '#2ca02c'],
             source=source, legend_label=['Low Salary', 'Medium Salary', 'High Salary'])

# Hover tooltip configuration
hover = HoverTool()
hover.tooltips = [
    ('Department', '@departments'),
    ('Low Salary', '@{Low Salary}'),
    ('Medium Salary', '@{Medium Salary}'),
    ('High Salary', '@{High Salary}')
]
p.add_tools(hover)

# Customize the plot
p.y_range.start = 0
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_right"
p.legend.orientation = "vertical"

# Show plot
output_notebook()
show(p)

Summary: In the sales department, the number of employees with low salary is the highest, totaling 269 people. This is followed by the technical department with 201 people and the support department with 141 people.

Impact of Employee Salary on Turnover Rate¶

In [ ]:
# Crosstab to compare turnover by salary level
salary_left_table = pd.crosstab(index=data['salary'], columns=data['left'])

# Plotting bar chart to show turnover rate by salary level
fig, ax = plt.subplots(figsize=(6, 5))
bars = salary_left_table.plot(kind='bar', stacked=False, ax=ax, color=['#1f77b4', '#ff7f0e'])

# Setting legend labels
ax.legend(['Stay', 'Left'])
ax.set_title("Impact of Employee Salary on Turnover")
ax.set_xlabel("Salary")
ax.set_ylabel("")
plt.xticks(rotation=0)

# Adding labels on top of each bar
for bar in bars.patches:
    # Get the height of the bar
    height = bar.get_height()
    # Add text annotation above each bar with its value
    ax.annotate(f'{height}', xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3),  # 3 points vertical offset
                textcoords="offset points",
                ha='center', va='bottom')

plt.show()
No description has been provided for this image

Summary: The turnover rate for employees with low and medium salaries is relatively high, at 42% and 26% respectively, while the turnover rate for employees with high salaries is only 7%.

Comparison of Employee Promotions in Last 5 Years and Turnover¶

In [ ]:
# Crosstab to compare turnover by promotion status
promotion_left_table = pd.crosstab(index=data['promotion_last_5years'], columns=data['left'])
promotion_dict = {0: "No Promotion", 1: "Promotion"}

# Plotting bar chart to show turnover rate by promotion status
fig, ax = plt.subplots(figsize=(6, 6))
promotion_left_table.plot(kind='bar', stacked=False, ax=ax, color=['#1f77b4', '#ff7f0e'])
ax.legend(['Stay', 'Left'])
ax.set_title("Impact of Employee Promotions in Last 5 Years on Turnover")
ax.set_xlabel("Promotion Status")
ax.set_ylabel("Count")
ax.set_xticklabels([promotion_dict[val] for val in promotion_left_table.index], rotation=0)
plt.show()
No description has been provided for this image

Summary: The turnover rate for employees who have not been promoted in the past 5 years is much higher compared to those who have been promoted. 94% of the employees who have been promoted are still employed.

Comparison of Employee Performance Evaluation and Turnover¶

In [ ]:
evaluation_left_table = pd.crosstab(index=data['last_evaluation'], columns=data['left'])

# Plotting bar chart to show turnover rate by performance evaluation
fig, ax = plt.subplots(figsize=(10, 5))
evaluation_left_table.plot(kind='bar', stacked=False, ax=ax, color=['#1f77b4', '#ff7f0e'])
ax.legend(['Stay', 'Left'])
ax.set_title("Impact of Employee Performance Evaluation on Turnover")
ax.set_xlabel("Performance Evaluation")
ax.set_ylabel("Count")
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

Summary: Many of the employees who left the company had received high performance evaluations.

Modeling¶

Correlation Analysis¶

In [ ]:
corr = numeric_df.corr()
mask = np.zeros_like(corr)
mask[np.tril_indices_from(mask)]=True
In [ ]:
with sns.axes_style("white"):
    sns.set(rc={'figure.figsize':(7,7)})
    ax = sns.heatmap(corr,
                xticklabels=True, yticklabels=True,
                cmap='RdBu',
                mask=mask,
                fmt='.3f',
                annot=True,
                linewidths=.5,
                vmax=.4,
                square = True
                # center = 0
                )

plt.title("Correlation")
label_x = ax.get_xticklabels()
plt.setp(label_x,rotation=45, horizontalalignment='right')
plt.show()
No description has been provided for this image

Summary:

Satisfaction Level and Time Spent at the Company appear to have the strongest correlations with Employee Turnover (left).

Number of Projects and Average Monthly Hours also show moderate correlations with each other and with Last Evaluation, suggesting potential workload implications.

Work Accident, Promotion in Last 5 Years, Salary Level, and Department show minimal correlations with other variables in the dataset.

Prepared training and testing sets¶

In [ ]:
# Selecting relevant features and target variable
features = ['satisfaction_level', 'last_evaluation', 'number_project',
            'average_montly_hours', 'time_spend_company', 'Work_accident',
            'promotion_last_5years']
target = 'left'

X = data[features]
y = data[target]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

RandomForestClassifier¶

In [ ]:
# Creating the RandomForestClassifier
rdf = RandomForestClassifier(n_estimators=100, random_state=42)
cv=KFold(n_splits=10, shuffle = True, random_state=90)
rdf_score=cross_val_score(rdf,X_train,y_train,cv=cv).mean()
# Training the model
rdf.fit(X_train, y_train)

# Predicting probabilities on the test set
y_pred_prob = rdf.predict_proba(X_test)[:, 1]
rdf_y_pred = rdf.predict(X_test)
# Computing ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)
rdf_auc = roc_auc_score(y_test,y_pred_prob)
In [ ]:
# Plotting ROC curve
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()
No description has been provided for this image

XGBoost¶

In [ ]:
!pip install xgboost
Requirement already satisfied: xgboost in /usr/local/lib/python3.10/dist-packages (2.0.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.25.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.11.4)
In [ ]:
from xgboost import XGBClassifier
xgb = XGBClassifier(n_estimators=100,random_state=42)
xgb_score=cross_val_score(xgb,X_train,y_train,cv=cv).mean()
xgb.fit(X_train, y_train)
# Predict probabilities for the test set
y_pred_prob = xgb.predict_proba(X_test)[:, 1]
xgb_y_pred = xgb.predict(X_test)
# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
xgb_auc = roc_auc_score(y_test,y_pred_prob)
In [ ]:
# Plot the ROC curve
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, color='blue', label=f'XGBoost (AUC = {roc_auc_score(y_test, y_pred_prob):.2f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
No description has been provided for this image

LogisticRegression¶

In [ ]:
from sklearn.linear_model import LogisticRegression as LR
# Train Logistic Regression model
logisticR = LR(max_iter=1000)
logisticR_score=cross_val_score(logisticR,X_train,y_train,cv=cv).mean()
logisticR.fit(X_train, y_train)

# Predict probabilities for the test set
y_pred_prob = logisticR.predict_proba(X_test)[:, 1]
logisticR_y_pred = logisticR.predict(X_test)
# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
logisticR_auc = roc_auc_score(y_test,y_pred_prob)
In [ ]:
# Plot the ROC curve
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, color='blue', label=f'Logistic Regression (AUC = {roc_auc_score(y_test, y_pred_prob):.2f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
No description has been provided for this image

Summary¶

In [ ]:
# 查看各模型classification_report
from sklearn.metrics import classification_report as CR
print('RandomForest'.center(50), CR(y_test,rdf_y_pred),sep='\n')
print('XGBoost'.center(55),CR(y_test,xgb_y_pred),sep='\n')
print('LogisticRegression'.center(50),CR(y_test,logisticR_y_pred),sep='\n')
                   RandomForest                   
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      3428
           1       0.99      0.96      0.98      1072

    accuracy                           0.99      4500
   macro avg       0.99      0.98      0.98      4500
weighted avg       0.99      0.99      0.99      4500

                        XGBoost                        
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      3428
           1       0.98      0.95      0.97      1072

    accuracy                           0.98      4500
   macro avg       0.98      0.97      0.98      4500
weighted avg       0.98      0.98      0.98      4500

                LogisticRegression                
              precision    recall  f1-score   support

           0       0.79      0.92      0.85      3428
           1       0.49      0.24      0.32      1072

    accuracy                           0.76      4500
   macro avg       0.64      0.58      0.59      4500
weighted avg       0.72      0.76      0.73      4500

In [ ]:
score={'Model_score':[rdf_score,xgb_score,logisticR_score],'Auc_area':[rdf_auc,xgb_auc,logisticR_auc]}
score_com=pd.DataFrame(data=score,index=['RandomForest','XGBoost','LogisticRegression'])
score_com.sort_values(by=['Model_score'],ascending=False)
Out[ ]:
Model_score Auc_area
RandomForest 0.990760 0.991931
XGBoost 0.986855 0.991493
LogisticRegression 0.768929 0.801972

RandomForest demonstrates the best performance in terms of both accuracy and AUC.

XGBoost follows closely behind, making both ensemble methods preferable choices for predicting employee turnover.

LogisticRegression does not perform as well in this particular case and is less suitable for deployment if high predictive accuracy is required.